APA effect size

A Suggested Revision to the
Forthcoming 5th Edition of the APA Publication Manual

Return to Bruce's Home Page.

APAeffec.wp 5/29/00
revision 6/16/00
**DRAFT** Effect Size section
suggested for the 5th edition APA publication manual
(suggested substitution for the 4th edition section
on Manual page 18)

Effect size. Because p values are confounded, joint functions of several study features, including effect size and sample size, calculated p values are not useful indices of study effects. As emphasized by the APA Task Force on Statistical Inference (Wilkinson & APA Task Force on Statistical Inference, 1999), "reporting and interpreting effect sizes in the context of previously reported effects is essential to good research" (p. 599, emphasis added).

Reporting effect sizes has three important benefits. First, reporting effects facilitates subsequent meta-analyses incorporating a given report. Second, effect size reporting creates a literature in which subsequent researchers can more easily formulate more specific study expectations by integrating the effects reported in related prior studies. Third, and perhaps most importantly, interpreting the effect sizes in a given study facilitates the evaluation of how a study's results fits into existing literature, the explicit assessment of how similar or dissimilar results are across related studies, and potentially informs judgment regarding what study features contributed to similarities or differences in effects.

For these reasons the 1994 fourth edition of the Publication Manual "encouraged" (p. 18) effect size reporting. However, 11 empirical studies of one or two post-1994 volumes of 23 journals found that this admonition had little, if any, impact (Vacha-Haase, Nilsson, Reetz, Lance & Thompson, 2000).

The reasons why the "encouragement" was ineffective, as reflected in the literature summary presented by Vacha-Haase et al. (2000), appear to be clear. As Thompson (1999) noted, only "encouraging" effect size reporting

presents a self-canceling mixed-message. To present an "encouragement" in the context of strict absolute standards regarding the esoterics of author note placement, pagination, and margins is to send the message, "these myriad requirements count, this encouragement doesn't." (p. 162)

Consequently, this edition of the Publication Manual incorporates as a requirement, "Always provide some effect-size estimate when reporting a p value" (Wilkinson & APA Task Force on Statistical Inference, 1999, p. 599, emphasis added).

In classical statistics, effect sizes characterize the fit of a model (e.g., a fixed-effects factorial ANOVA model) to data. Similarly, in structural equation modeling (SEM) goodness of fit indices may be thought of as effect sizes.

In a few analyses (e.g., randomization tests) effect size indices have not yet been formulated. However, confidence intervals intervals are quite useful in these instances, just as they are even when effect sizes can be computed. Reporting confidence intervals, especially in direct comparison with the confidence intervals from related prior studies, falls squarely within the spirit of required effect size reporting. The graphic presentation of confidence intervals can be particularly helpful to readers.

Numerous effect sizes can be computed. Useful reviews of various choices are provided by Kirk (1996), Olejnik and Algina (2000), Rosenthal (1994), and Snyder and Lawson (1993). However, a brief review of the available choices may be useful. Although there is a class of effect sizes that Kirk (1996) labelled "miscellaneous" (e.g., the odds ratios that are so important in loglinear analyses), there are two major classes of effect sizes for parametric analyses.

The first class of effect sizes involves standardized mean differences. Effect sizes in this class include indices such as Glass' D , Hedges' g, and Cohen's d. For example, Glass' D is computed as the difference in the two means (i.e., experimental group mean minus control group mean) divided by the control group standard deviation, where the SD computation uses n-1 as the denominator. When the study involves matched or repeated measures designs, the standardized difference is computed taking into account the correlation between measures (Dunlap, Cortina, Vaslow & Burke, 1996).

Of course, not all studies involve experiments or only a comparison of group means. Because all parametric analyses are part of one General Linear Model family, and are correlational, variance-accounted-for effect sizes can be computed in all studies, including both experimental and non-experimental studies. Effect sizes in this second class include indices such as r², R², and h ². For example, for regression, R² can be computed as the sum-of-squares explained divided by the sum-of-squares total. Or, for a one-way ANOVA, h ² is computed as the sum-of-squares explained divided by the sum-of-squares total.

The General Linear Model is a powerful heuristic device (cf. Cohen, 1968), as suggested by commonalties in variance-accounted-for effect size formulas. However, in many applications it is advisable to convert these indices to unsquared metrics, for reasons summarized elsewhere (cf. D'Andrade & Dart, 1990; Ozer, 1985). When measures have intrinsically meaningful non-arbitrary metrics, as occasionally occurs in psychology, unstandardized effect indices may be more useful than standardized differences or variance-accounted-for or r statistics (Judd, McClelland & Culhane, 1995).

The effect sizes in these two classes--standardized differences and r--can be transformed into each others' metrics. For example, a Cohen's d can be converted to an r using Cohen's (1988, p. 23) formula #2.2.6:

r = d / [(d² + 4)^.5]
= 0.8 / [(0.8² + 4)^.5]
= 0.8 / [(0.64 + 4)^.5]
= 0.8 / [( 4.64)^.5]
= 0.8 / 2.154
= 0.371 .

When total sample size is small or group sizes are disparate, it is advisable to use a slightly more complicated but more precise formula elaborated by Aaron, Kromrey and Ferron (1998):
r = d / [(d² + [(N² - 2N)/(n₁ n₂)] ^.5].

Or an r can be converted to a d using Friedman's (1968, p. 246) formula #6:

d = [2 (r)] / [(1 - r²)^.5]
= [2 ( 0.371 )] / [(1 - 0.371²)^.5]
= [2 (0.371)] / [(1 - 0.1376)^.5]
= [2 (0.371)] / (0.8624)^.5
= [2 (0.371)] / 0.9286
= 0.742 / 0.9286
= 0.799 .

In addition to choosing between standardized difference and variance-accounted-for (or r) effect sizes, researchers must choose between "uncorrected" and "corrected" effect sizes. Like people, each individual sample has its own personality, or variance that is unique to that given sample. The effect sizes computed for a sample are inflated by capitalizing on this "sampling error variance."

However, we know what factors contribute to sampling error variance. Samples have more sampling error variance when (a) sample sizes are smaller, (b) the number of observed variables is larger, and (c) the population effect size is smaller. Because we know what factors contribute to sampling error variance, we can estimate the amount of positive bias in a variance-accounted-for effect size, and then estimate a "shrunken" or "corrected" effect size with the estimating sampling error variance removed. The "corrected" variance-accounted-for effect sizes include indices such as "adjusted R²," Hays' w ², and Herzberg's R².

No one effect size is appropriate for all research situations. However, psychology as a field will be more fully informed by inquiry in which researchers report and interpret an effect size, whatever that index may be.

It should also be noted that Cohen (1988) provided rules of thumb for characterizing what effect sizes are small, medium, or large, as regards his impressions of the typicality of effects in the social sciences generally. However, he emphasized that the interpretation of effects requires the researcher to think more narrowly in terms of a specific area of inquiry. And the evaluation of effect sizes inherently requires an explicit researcher personal value judgment regarding the practical or clinical importance of the effects. Finally, it must be emphasized that if we mindlessly invoke Cohen's rules of thumb, contrary to his strong admonitions, in place of the equally mindless consultation of p value cutoffs such as .05 and .01, we are merely electing to be thoughtless in a new metric.

Aaron, B., Kromrey, J.D. & Ferron, J.M. (1998, November). Equating r-based and d-based effect size indices: Problems with a commonly recommended formula. Paper presented at the annual meeting of the Florida Educational Research Association, Orlando, FL. (ERIC Document Reproduction Service No. ED 433 353)

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426-443.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

D'Andrade, R., & Dart, J. (1990). The interpretation of r versus r² or why percent of variance accounted for is a poor measure of size of effect. Journal of Quantitative Anthropology, 2, 47-59.

Dunlap, W.P., Cortina, J.M., Vaslow, J.B., & Burke, M.J. (1996). Meta-analysis of experiments with matched groups or repeated measures designs. Psychological Methods, 1, 170-177.

Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin, 70, 245-251.

Judd, C. M., McClelland, G. H. & Culhane, S. E. (1995). Data analysis: Continuing issues in the everyday analysis of psychological data. Annual Review of Psychology, 46, 433-465.

Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.

Olejnik, S., & Algina, J. (2000). Measures of effect size for comparative studies: Applications, interpretations, and limitations. Contemporary Educational Psychology, 25, 241-286.

Ozer, D. J. (1985). Correlation and the coefficient of determination. Psychological Bulletin, 97, 307- 315.

Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L.V. Hedges (Eds.), The handbook of research synthesis (pp. 231-244). New York: Russell Sage Foundation.

Snyder, P., & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 334-349.

Thompson, B. (1999). Journal editorial policies regarding statistical significance tests: Heat is to fire as p is to importance. Educational Psychology Review, 11, 157-169.

Vacha-Haase, T., Nilsson, J.E., Reetz, D.R., Lance, T.S. & Thompson, B. (2000). Reporting practices and APA editorial policies regarding statistical significance and effect size. Theory & Psychology, 10, 413-425.

Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604. [reprint available through the APA Home Page:
http://www.apa.org/journals/amp/amp548594.html]

Return to Bruce's Home Page.